General Questions

When presented with a new dataset or database, what steps do you generally take to evaluate it prior to working with it?

When I encounter a new dataset, I look at the underlying structure of the data table and data pipelines. This helps me understand how the data is collected and stored. Then I look for primary keys and similar fields in other data tables that might be relevant to the problem at hand. This is useful when we think about joining tables, for instance, patient data with treatment data. Finally, I clean, format, and normalize the data. If I am uncertain about the data structure or the data itself, I ask questions. Before I explore the data, I consider the primary questions we want to answer and use cases for how the analysis might be consumed by stakeholders.

Based on the information provided above and the attached dataset, what three questions would you like to understand prior to conducting any analysis of the data?

As a statistician, my question is- Were the patients randomly selected? Can we generalize results from this experiment to a larger population of patients? As a data scientist, I want to know- Where did the data come from? Has the data been normalized? As a project stakeholder, I ask- What problem do we want to address, and who are the consumers of the analysis?

Data Analysis

Question 1

To determine the distribution of cancer types across patients, we took a look at the number of patients under each diagnosis code and diagnosis category. We note that colon cancer types have diagnosis codes that begin with 153 and breast cancer types have diagnosis codes beginning in 174. We found that 39 patients were diagnosed with breast cancer and 18 patients were diagnosed with colon cancer. We’ve printed the count of patients under each diagnosis code below.

#Read in data
DiagnosisDat <- read.csv(file("/Users/anniezhang/Flatiron/Patient_Diagnosis.csv"))
TreatmentDat <- read.csv(file("/Users/anniezhang/Flatiron/Patient_Treatment.csv"))
table(DiagnosisDat$diagnosis_code)
## 
## 153.3 153.4 153.5 153.6 153.7 153.8 153.9 174.1 174.2 174.3 174.4 174.5 
##     3     5     4     1     1     1     3     5     1     3     1     3 
## 174.6 174.7 174.8 174.9 
##     1     4     5    16
table(DiagnosisDat$diagnosis)
## 
## Breast Cancer  Colon Cancer 
##            39            18

Question 2

We analyzed the data and calculated the number of days between diagnosis date and earliest date of treatment for each patient. Our cleaned data set includes 52 unique diagnosis cases.

We identified one case, patient_id 4256, with a null value for treatment date for their diagnosis. We reason this patient opted out of receiving treatment. This assumption seems reasonable and we exclude their case to analyze days between diagnosis and treatment. We acknowledge potential bias introduced from our assumption and may revisit the assumption in a later study. Future analysis should be done to understand reasons patients don’t receive treatment and forecast whether or not a patient will pursue treatment. We notice one case where the days between diagnosis and treatment date is 304 days for a colon cancer patient. After looking at our data, we reason that this case is an outlier and exclude it from our analysis.

DiagnosisDat$diagnosis_date <- mdy(DiagnosisDat$diagnosis_date)
TreatmentDat$treatment_date <- mdy(TreatmentDat$treatment_date)

CombinedDat <- left_join(x=DiagnosisDat, y = TreatmentDat, by = c("patient_id" = "patient_id"))
CleanDat <- CombinedDat %>%
  group_by(patient_id,diagnosis_date,diagnosis) %>%
  summarise(earliest_date = head(treatment_date,1))
CleanDat$date_diff <- CleanDat$earliest_date - CleanDat$diagnosis_date
#Remove NULL value. See assumption above.
CleanDat <- na.omit(CleanDat)
which(CleanDat$date_diff==304)
## [1] 10
#Remove colon cancer outlier at index 10.
CleanDat <- CleanDat[-10,]

After we remove the two cases above, our data contains 50 patient ids, diagnosis date, diagnosis, and earliest treatment date. We take the difference between earliest treatment date and diagnosis date to determine how long after being diagnosed patients start treatment. On average, the number of days before breast cancer patients start treatment is 4.86 days. The variance in days between diagnosis and treatment is 19.18 days. In comparison, the average number of days before colon cancer patients start treatment is 4.2 days. The variance for colon cancer patients is 8.46 days.

head(CleanDat,5)
## # A tibble: 5 x 5
## # Groups:   patient_id, diagnosis_date [5]
##   patient_id diagnosis_date diagnosis     earliest_date date_diff
##        <int> <date>         <fct>         <date>        <drtn>   
## 1       2038 2010-01-21     Breast Cancer 2010-01-24     3 days  
## 2       2120 2010-01-09     Breast Cancer 2010-01-23    14 days  
## 3       2175 2010-02-17     Breast Cancer 2010-02-21     4 days  
## 4       2238 2010-01-21     Breast Cancer 2010-01-21     0 days  
## 5       2407 2010-06-13     Breast Cancer 2010-06-19     6 days
## Breast Cancer mean and variance
## Time difference of 4.857143 days
## [1] 19.18487
## Colon Cancer mean and variance
## Time difference of 4.2 days
## [1] 8.457143

It appears that breast cancer and colon cancer patients wait a similar number of days, on average, between diagnosis and treatment; however, one point worth noting is the spread in wait time for breast cancer patients is wider with a much larger variance. Breast cancer patients vary more in when they start treatment after diagnosis.

Question 3

For breast cancer, we found that patients followed first-line treatment regimens of chemotherapy drugs A, B, AB in combination, and immunotherapy drug C. Of the six breast cancer patients using drug C, all patients used drug C alone without combining with other drugs. Breast cancer patients did not use drug D as part of their treatment regimen.

For colon cancer, we found that patients followed first-line treatment regimens A, B AB, C, D. As was the case for breast cancer patients, colon cancer patients who used immunotherapy drug C as a first-line of treatment did not combine with any other drugs. Colon cancer patients who used drug D, an immunotherapy drug, as a first-line of treatment also did not combine with any other drugs.

We show the counts of each treatment regimen below.

CleanDat <- na.omit(CleanDat)
DrugDat <- left_join(x=CleanDat,y=TreatmentDat,by = c("patient_id" = "patient_id","earliest_date"="treatment_date"))

DrugDat %>%
  group_by(diagnosis) %>%
    count(drug_code)
## # A tibble: 7 x 3
## # Groups:   diagnosis [2]
##   diagnosis     drug_code     n
##   <fct>         <fct>     <int>
## 1 Breast Cancer A            22
## 2 Breast Cancer B            34
## 3 Breast Cancer C             6
## 4 Colon Cancer  A             3
## 5 Colon Cancer  B             8
## 6 Colon Cancer  C             4
## 7 Colon Cancer  D             4

For breast cancer, chemotherapy drugs A and B are first-line treatments independently and in combination. Our data indicates immunotherapy drug C can be used as a first-line treatment when not combined with any other drugs. We conjecture that these patients may have been diagnosed in later stages of their cancer, when chemotherapy drugs will not work. For colon cancer, our data shows regimens A and B are first-line treatments, independently and sometimes combined. Patients also used immunotherapy drugs C or D as first-line treatments when used alone without other drugs.

Question 4

CleanDat <- CombinedDat %>%
  group_by(patient_id,diagnosis_date,diagnosis) %>%
  filter(diagnosis=="Breast Cancer") %>%
  summarise(earliest_date = head(treatment_date,1), latest_date = tail(treatment_date,1) )
CleanDat <- na.omit(CleanDat)
CleanDat$duration <- CleanDat$latest_date - CleanDat$earliest_date

DurationDat <- inner_join(x=CleanDat,y=TreatmentDat,by = c("patient_id" = "patient_id","latest_date" = "treatment_date"))
DurationDat <- na.omit(DurationDat)
head(DurationDat,5)
## # A tibble: 5 x 7
## # Groups:   patient_id, diagnosis_date [3]
##   patient_id diagnosis_date diagnosis earliest_date latest_date duration
##        <int> <date>         <fct>     <date>        <date>      <drtn>  
## 1       2038 2010-01-21     Breast C… 2010-01-24    2017-02-20  2584 da…
## 2       2120 2010-01-09     Breast C… 2010-01-23    2010-03-02    38 da…
## 3       2120 2010-01-09     Breast C… 2010-01-23    2010-03-02    38 da…
## 4       2175 2010-02-17     Breast C… 2010-02-21    2010-04-03    41 da…
## 5       2175 2010-02-17     Breast C… 2010-02-21    2010-04-03    41 da…
## # … with 1 more variable: drug_code <fct>
A_all <- as.numeric(DurationDat$duration[which(DurationDat$drug_code=="A")])
B_all <- as.numeric(DurationDat$duration[which(DurationDat$drug_code=="B")])

On average, patients following regimen A have a mean treatment duration of 242.4 days. We grouped treatment durations by intervals of 10 days. The distribution for regimen A is centered at the median of 64.5 days. We identify one patient who undergoes therapy for 2584 days. We know mean is affected by large outliers and recommend median as a better measure of central tendency here. The average treatment duration for regimen A is approximately 64.5 days.

Patients following regimen B have a mean treatment duration of 74.7 days. We looked at the distribution of durations, again bucketing by duration intervals of 10 days. The distribution is bimodal with one center at 0 days and another center at 56.5 days. We noticed one outlier with treatment duration of 1001 days in our data. We recommend using median of each subgroup, one where patients completed on day 0 and one where patients undergo treatment for multiple days, to measure average therapy duration. The distribution for patients following regimen B for 1 or more days is centered at the median of 56.5 days.

###REGIMEN A
mean(A_all)
## [1] 242.4286
summary(A_all)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   38.00   51.25   64.50  242.43   77.00 2584.00
####REGIMEN B
mean(B_all)
## [1] 74.72727
summary(B_all)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   39.50   74.73   58.25 1001.00
#Median for nonzero durations
median(B_all[-c(7:16)]) 
## [1] 56.5
#Time buckets
y1 = cases(
    "days<40"=A_all <40,
    "40<days<50"=A_all <50,
    "50<days<60"=A_all <60,
    "60<days<70"=A_all <70,
    "70<days<80"=A_all <80,
    "80<days<90"=A_all <90,
    "90<days<100"=A_all <100,
    "days>100"=TRUE
    )
y2 = cases(
    "days=0"=B_all <10,
    "10<days<20"=B_all <20,
    "20<days<30"=B_all <30,
    "30<days<40"=B_all <40,
    "40<days<50"=B_all <50,
    "50<days<60"=B_all <60,
    "60<days<70"=B_all <70,
    "70<days<80"=B_all <80,
    "days>80"=TRUE
    )

plot(y1,main="Duration Distribution for Regimen A",ylab="Number of Patients")

plot(y2,main="Duration Distribution for Regimen B",ylab="Number of Patients")

AB_combined <- DurationDat %>%
  group_by(patient_id,duration) %>%
  filter(drug_code =="A"|drug_code== "B") 
AB_combined$date_diff <- NA
AB_combined <- data.frame(AB_combined)
dur <- as.numeric(AB_combined$duration)
code <- factor(AB_combined$drug_code)

dat <- data.frame(code = code, duration = dur)
outlier_index <- which(dat$duration==2584 | dat$duration==1001)
newdat <- dat[-outlier_index,]

We show the spread of regimen A by looking at side by side boxplots in the plot below. We can see that regimen B has a larger spread with a wider interquartile range than regimen A. We perform an Analysis of Variance test to determine whether patients following regimen A versus regimen B as first-line therapy for breast cancer have a statistically significant difference in duration of therapy.

################ SIDE BY SIDE BOXPLOTS ###################
p <- ggplot(newdat, aes(x=code, y=duration, fill=code)) +
  ggtitle("Therapy Duration by Regimen") +
   geom_boxplot() + guides(fill=FALSE) + coord_flip() +
    stat_summary(fun.y=mean, geom="point", shape=5, size=4)
p <- ggplotly(p)
p

From our ANOVA test, we obtained a p-value of 0.281. We fail to reject the null hypothesis that patients following regimen A as first-line therapy for breast cancer have the same therapy duration as patients following regimen B. At the 5% significance level, we do not have sufficient evidence to claim that therapy durations between regimen A and B are significantly different.

####################### ANOVA TEST ############################
m <- lm(duration~code,dat)
summary(m)
## 
## Call:
## lm(formula = duration ~ code, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -204.43 -166.18  -74.73  -23.48 2341.57 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    242.4      119.8   2.024   0.0509 .
## codeB         -167.7      153.2  -1.095   0.2814  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 448.1 on 34 degrees of freedom
## Multiple R-squared:  0.03404,    Adjusted R-squared:  0.005633 
## F-statistic: 1.198 on 1 and 34 DF,  p-value: 0.2814
anova(m)
## Analysis of Variance Table
## 
## Response: duration
##           Df  Sum Sq Mean Sq F value Pr(>F)
## code       1  240614  240614  1.1983 0.2814
## Residuals 34 6827232  200801